{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NVIDIA Merlin on Microsoft's News Dataset (MIND)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial notebook, we would be using the [Microsoft's News Dataset (MIND)](https://msnews.github.io/) to demonstrate NVTabular for ETL the data and HugeCTR for training Deep Neural Network models for building a Recommender System.\n",
    "\n",
    "The MIND dataset contains 15M impressions generated by 1M users over 160k news articles. Our goal from this jupyter notebook would be to train a model that can predict whether a user would click on a news article or not.\n",
    "\n",
    "In order to build a Recommender System, we would be first cleaning and pre-processing the data, then developing simple time based and complex target & count encoded features to finally train and evaluate Deep Learning Recommendation Model (DLRM)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please remember to run this jupyter notebook in the [merlin-training:22.05](https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-training) docker container."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Import libraries and create directories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install packages required for this notebook\n",
    "!pip install tqdm graphviz\n",
    "!apt install wget unzip graphviz -y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time, glob, shutil, sys, os, pickle, json\n",
    "from tqdm import tqdm\n",
    "\n",
    "import cupy as cp          # CuPy is an implementation of NumPy-compatible multi-dimensional array on GPU\n",
    "import cudf                # cuDF is an implementation of Pandas-like Dataframe on GPU\n",
    "import rmm                 # library for pre-allocating memory on GPU\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "# NVTabular is the core library we will use here for feature engineering/preprocessing on GPU\n",
    "from nvtabular.ops import Operator\n",
    "import nvtabular as nvt\n",
    "from nvtabular.utils import device_mem_size\n",
    "\n",
    "# Dask is the backend job scheduler used by NVTabular\n",
    "import dask   \n",
    "import dask_cudf\n",
    "from dask_cuda import LocalCUDACluster\n",
    "from dask.distributed import Client\n",
    "from dask.distributed import wait\n",
    "from dask.utils import parse_bytes\n",
    "from dask.delayed import delayed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is often a good idea to set-aside (fast) dedicated disk space for dask workers to spill data and logging information. To make things simple, we will perform all IO within a single `BASE_DIR` for this example. Feel free to set this variable yourself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define \"fast\" root directory for this example\n",
    "BASE_DIR = os.environ.get(\"BASE_DIR\", \"./basedir\")\n",
    "\n",
    "# Define worker/output directories\n",
    "dask_workdir = os.path.join(BASE_DIR, \"workdir\")\n",
    "\n",
    "# Directory to store the raw downloaded dataset\n",
    "data_input_path = os.path.join(BASE_DIR, \"dataset\")\n",
    "data_path_train = os.path.join(data_input_path, \"train\")\n",
    "data_path_valid = os.path.join(data_input_path, \"valid\")\n",
    "\n",
    "# Directory to store NVTabular's processed dataset\n",
    "data_output_path = os.path.join(BASE_DIR, \"processed_nvt\")\n",
    "output_train_path = os.path.join(data_output_path, \"train\")\n",
    "output_valid_path = os.path.join(data_output_path, \"valid\")\n",
    "\n",
    "# Directory to store HugeCTR's train configurations and weights\n",
    "config_output_path = os.path.join(BASE_DIR, \"configs\")\n",
    "weights_path = os.path.join(BASE_DIR, \"weights\")\n",
    "\n",
    "#Creating and cleaning our worker/output directories\n",
    "try:\n",
    "    # Ensure BASE_DIR exists\n",
    "    if not os.path.isdir(BASE_DIR):\n",
    "        os.mkdir(BASE_DIR)\n",
    "\n",
    "    # Make sure we have a clean worker space for Dask\n",
    "    if os.path.isdir(dask_workdir):\n",
    "        shutil.rmtree(dask_workdir)\n",
    "    os.mkdir(dask_workdir)\n",
    "\n",
    "    # Make sure we have a clean path for downloading dataset and preprocessing\n",
    "    if os.path.isdir(data_input_path):\n",
    "        shutil.rmtree(data_input_path)\n",
    "    os.mkdir(data_input_path)\n",
    "    os.mkdir(data_path_train)\n",
    "    os.mkdir(data_path_valid)\n",
    "\n",
    "    # Make sure we have a clean output path\n",
    "    if os.path.isdir(data_output_path):\n",
    "        shutil.rmtree(data_output_path)\n",
    "    os.mkdir(data_output_path)\n",
    "    os.mkdir(output_train_path)\n",
    "    os.mkdir(output_valid_path)\n",
    "    \n",
    "    # Make sure we have a clean configs and weights path\n",
    "    if os.path.isdir(config_output_path):\n",
    "        shutil.rmtree(config_output_path)\n",
    "    os.mkdir(config_output_path)    \n",
    "        \n",
    "    if os.path.isdir(weights_path):\n",
    "        shutil.rmtree(weights_path)\n",
    "    os.mkdir(weights_path)\n",
    "\n",
    "except OSError:\n",
    "    print (\"Creation of the directories failed\")\n",
    "else:\n",
    "    print (\"Successfully created the directories\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following directory structure has been created and would be used to store everything concerning this tutorial:\n",
    "\n",
    "basedir <br>\n",
    "&emsp; |--- workdir    \n",
    "&emsp; |--- dataset <br>\n",
    "&emsp; &emsp; |--- train <br>\n",
    "&emsp; &emsp; |--- valid  <br>\n",
    "&emsp; |--- processed_nvt <br>\n",
    "&emsp;  &emsp; |--- train <br>\n",
    "&emsp;  &emsp; |--- valid  <br>\n",
    "&emsp; |--- configs <br>\n",
    "&emsp; |--- weights <br>    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Deploy a Distributed-Dask cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the GPUs that are available to this notebook\n",
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize Dask GPU Cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NUM_GPUS = [0,1,2,3] # Set this to the GPU IDs that are observed from the above cell\n",
    "\n",
    "# Dask dashboard\n",
    "dashboard_port = \"8787\"\n",
    "\n",
    "# Deploy a single-machine multi-GPU cluster\n",
    "protocol = \"tcp\"             # \"tcp\" or \"ucx\"\n",
    "visible_devices = \",\".join([str(n) for n in NUM_GPUS])  # Detect devices to place workers\n",
    "device_spill_frac = 0.9      # Spill GPU-Worker memory to host at this limit.\n",
    "                             # Reduce if spilling fails to prevent\n",
    "                             # device memory errors.\n",
    "\n",
    "# Get device memory capacity\n",
    "capacity = device_mem_size(kind=\"total\") \n",
    "\n",
    "# Check if any device memory is already occupied\n",
    "\"\"\"\n",
    "for dev in visible_devices.split(\",\"):\n",
    "    fmem = _pynvml_mem_size(kind=\"free\", index=int(dev))\n",
    "    used = (device_size - fmem) / 1e9\n",
    "    if used > 1.0:\n",
    "        warnings.warn(f\"BEWARE - {used} GB is already occupied on device {int(dev)}!\")\n",
    "\"\"\"\n",
    "\n",
    "cluster = None               # (Optional) Specify existing scheduler port\n",
    "if cluster is None:\n",
    "    cluster = LocalCUDACluster(\n",
    "        protocol = protocol,\n",
    "        n_workers=len(visible_devices.split(\",\")),\n",
    "        CUDA_VISIBLE_DEVICES = visible_devices,\n",
    "        device_memory_limit = capacity * device_spill_frac,\n",
    "        local_directory=dask_workdir,\n",
    "        dashboard_address=\":\" + dashboard_port,\n",
    "    )\n",
    "\n",
    "# Create the distributed client\n",
    "client = Client(cluster)\n",
    "client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize RMM pool on ALL workers\n",
    "def _rmm_pool():\n",
    "    rmm.reinitialize(\n",
    "        pool_allocator=True,\n",
    "        initial_pool_size=None, # Use default size\n",
    "    )\n",
    "    \n",
    "client.run(_rmm_pool)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3:  Download & explore MIND dataset\n",
    "\n",
    "MIcrosoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website.\n",
    "\n",
    "Please read and accept the Microsoft Research License Terms before downloading.\n",
    "\n",
    "Let's download the train and validation set, and unzip them to their respective directories. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://mind201910small.blob.core.windows.net/release/MINDlarge_train.zip\n",
    "!wget https://mind201910small.blob.core.windows.net/release/MINDlarge_dev.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!unzip MINDlarge_train.zip -d $BASE_DIR/dataset/train\n",
    "!unzip MINDlarge_dev.zip -d  $BASE_DIR/dataset/valid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The MIND dataset for news recommendation was collected from anonymized behavior logs of Microsoft News website. To protect user's privacy, each user is de-linked from the production system when securely hashed into an anonymized ID. \n",
    "MIND dataset team has randomly sampled 1M users who had at least 5 news clicks from October 12 to November 22, 2019 (6 weeks).\n",
    "\n",
    "Microsoft has provided train, validation and test sets of this data but we are going to use the train and validation set for this tutorial.\n",
    "\n",
    "### Dataset format \n",
    "\n",
    "Each set of this data contains the following 4 files:\n",
    "\n",
    "1. behaviors.tsv - The click history and impression logs of users\n",
    "2. news.tsv - Details of news articles mapped with the news ID\n",
    "3. entity_embedding.vec - The embeddings of entities in news extracted from knowledge graph\n",
    "4. relation_embedding.vec - The embeddings of relations between entities extracted from knowledge graph\n",
    "\n",
    "Let's take a look at both these TSV files and understand how we can utilise them for our Recommendation System. <br>\n",
    "Note - For the ease of this tutorial, we are ignoring the embeddings provided by the MIND team."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Behaviors data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "behaviors_train = cudf.read_csv(os.path.join(data_path_train , 'behaviors.tsv'), \n",
    "                                header=None, \n",
    "                                sep='\\t',)\n",
    "behaviors_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each row in this data file represents one instance of an impression generated by the user. The columns of behaviors data are represented as:<br>\n",
    "\n",
    "[Impression ID] [User ID] [Time until when Impression Recorded] [User Click History] [Impression News]\n",
    "\n",
    "**Column 0**: Impression ID (int64)<br>\n",
    "This is the ID of the impression generated.<br>\n",
    "e.g. 1,2,3,4,5\n",
    "        \n",
    "**Column 1**: User ID (string)<br>\n",
    "The anonymous ID of a user who has generated that impression.<br>\n",
    "e.g. U89 , U395 , U60005, U3965770\n",
    "        \n",
    "**Column 2**: Time (timestamp)<br>\n",
    "The impression time with format `MM/DD/YYYY HH:MM:SS AM/PM` <br>\n",
    "This is the point of time upto which the user's impression have been captured. \n",
    "            \n",
    "**Column 3**: History (string)<br>\n",
    "The news click history of this user before this impression. The clicked news articles are ordered by time.<br>\n",
    "e.g. N106403 N71977 N97080 N102132 N97212 N121652\n",
    "        \n",
    "**Column 4**: Impressions (string)<br>\n",
    "List of news displayed to the user and user's click behaviors on them (1 for click and 0 for non-click).<br>\n",
    "e.g. N129416-0 N26703-1 N120089-1 N53018-0 N89764-0 N91737-0 N29160-0\n",
    "   \n",
    "The corresponding details of news ID in history and impression columns would be present in the news.tsv file.\n",
    "\n",
    "For more details on dataset: Official MIND Dataset Description, click [Official Dataset Description](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)\n",
    "\n",
    "Let's reload the data with their respective column names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "behaviors_columns = ['impression_id', 'uid', 'time', 'history', 'impressions']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "behaviors_train = cudf.read_csv(os.path.join(data_path_train , 'behaviors.tsv'), \n",
    "                          header=None, \n",
    "                          names=behaviors_columns,\n",
    "                    sep='\\t',)\n",
    "behaviors_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "behaviors_valid = cudf.read_csv(os.path.join(data_path_valid , 'behaviors.tsv'), \n",
    "                          header=None, \n",
    "                          names=behaviors_columns,\n",
    "                    sep='\\t',)\n",
    "behaviors_valid.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### News data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "news_train = cudf.read_csv(os.path.join(data_path_train , 'news.tsv'), \n",
    "                          header=None, \n",
    "                          sep='\\t',)\n",
    "news_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each row in this data file represents a news article and its attributes. The columns of this data file are:\n",
    "\n",
    "[News ID] [Category] [Subcategory] [News Title] [News Abstract] [News Url] [Entities in News Title] [Entities in News Abstract]\n",
    "\n",
    "**Column 0**: News ID (string)<br>\n",
    "This is the ID of the news article<br>\n",
    "e.g. N89 , N395 , N60005, N3965770\n",
    "        \n",
    "**Column 1**: Category (string)<br>\n",
    "Category of the news. There are 18 categories<br>\n",
    "e.g. sports , health , news ... etc\n",
    "        \n",
    "**Column 2**: SubCategory (string)<br>\n",
    "Sub-category of the news. There are 242 unique sub-categories.<br>\n",
    "e.g. golf, newsscienceandtechnology, medical, newsworld ... etc\n",
    "            \n",
    "**Column 3**: Title (string)<br>\n",
    "Title of the news article<br>\n",
    "e.g. PGA Tour winners, 50 Worst Habits For Belly Fats ... etc\n",
    "        \n",
    "**Column 4**: Abstract (string)<br>\n",
    "Abstract of the news article<br>\n",
    "e.g. A gallery of recent winners on the PGA Tour, These seemingly harmless habits are holding\n",
    "          \n",
    "**Column 5**: URL (string)<br>\n",
    "URL to the MSN site where the news article was published.<br>\n",
    "e.g. https://www.msn.com/en-us/sports/golf/pga-tour-winners/ss-AAjnQjj?ocid=chopendata\n",
    "        \n",
    "**Column 6**: Title Entities (string)<br>\n",
    "Entities present in the title\n",
    "        \n",
    "**Column 7**: Abstract Entites (string)<br>\n",
    "Entites present in the abstract\n",
    "\n",
    "Let's reload the data with their respective column names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "news_columns = ['did', 'cat', 'sub_cat', 'title', 'abstract', 'url', 'title_entities', 'abstract_entities']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "news_train = cudf.read_csv(os.path.join(data_path_train , 'news.tsv'), \n",
    "                          header=None, \n",
    "                          names=news_columns,\n",
    "                    sep='\\t',)\n",
    "news_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "news_valid = cudf.read_csv(os.path.join(data_path_valid , 'news.tsv'), \n",
    "                          header=None, \n",
    "                          names=news_columns,\n",
    "                    sep='\\t',)\n",
    "news_valid.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 : Initial pre-processing to prepare dataset for feature engineering\n",
    "\n",
    "Before we use the data in NVTabular for pre-processing and feature engineering, we have to make a few changes to make it efficient for GPU operations.<br>\n",
    "The changes we have to make in the behaviours data file are:\n",
    "   - The history column is a long string and not a list. NVTabular support multi-hot categorical features but HugeCTR parquet reader does not,.Thus we need to extend the dataframe with multiple history columns, capturing each element in this long string. While extending the history columns, we have to make sure we pick the most recent history (in reverse chronological order).\n",
    "\n",
    "  - The impression column contains a long string of unique negative and positive values for the same impression event. Each of these unique values in this column is a data point for our model to learn from. Thus, these unique positive & negative entries should be unrolled into multiple rows. The row expansion operation is not supported in NVtabular and hence we're going to perform it with cuDF.\n",
    "\n",
    "As for the news data file, we would just be using the news id, category and sub-category columns.<br>\n",
    "Their are many ways to use the other columns (title, abstract, entities etc.) as features but we would leave it up to you to explore those.\n",
    "\n",
    "In a nutshell, we are going to take the raw downloaded dataset, do these basic pre-processing using cuDF, generate a new train dataset which will then be used for further processing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pre-process 1: Drop columns from the news dataset\n",
    "\n",
    "The columns that we would drop from the news.tsv are: 'title', 'abstract', 'url', 'title_entities', 'abstract_entities'\n",
    "\n",
    "We encourage you to explore using 'title_entities' and 'abstract_entities' as categorical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "news_train = news_train.drop(['title', 'abstract', 'url', 'title_entities', 'abstract_entities'],axis = 1)\n",
    "news_valid = news_valid.drop(['title', 'abstract', 'url', 'title_entities', 'abstract_entities'],axis = 1)\n",
    "\n",
    "# Merging news train/valid dataset to have a single view of news and their attributes\n",
    "news = cudf.concat([news_train,news_valid]).drop_duplicates().reset_index().drop(['index'],axis=1)\n",
    "\n",
    "# Freeing up memory by nulling the variables\n",
    "news_train = None\n",
    "news_valid = None\n",
    "\n",
    "news.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Pre-process 2:  Label encoding for categorical variables\n",
    "\n",
    "Strings require significant amount of memory as compared to integers. As an example, representing the string `cfcd208495d565ef66e7dff9f98764` as integer `0` can save upto 90% memory.\n",
    "\n",
    "Thus, we would be label encoding the categorical variables in our dataset so that the downstream pre-preprocessing and feature engineering pipelines doesn't consume a high amount of memory.<br>\n",
    "We will also label encode low cardinality columns in news.tsv like the news_categories and news_subcategories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Encoding user id from both train and validation dataframe\n",
    "user_index = {} \n",
    "\n",
    "temp = cudf.concat([behaviors_train['uid'],behaviors_valid['uid']]).unique().to_pandas() \n",
    "for i in tqdm(range(len(temp)),total = len(temp)):\n",
    "    user_index[temp[i]] = i + 1    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replacing uid in the dataset with their respective indexes\n",
    "\n",
    "behaviors_train['uid'] = behaviors_train['uid'].replace([i for i in user_index],[str(user_index[i]) for i in user_index]).astype('int')\n",
    "behaviors_valid['uid'] = behaviors_valid['uid'].replace([i for i in user_index],[str(user_index[i]) for i in user_index]).astype('int')\n",
    "\n",
    "# Freeing up memory by nulling variables\n",
    "user_index = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Encoding news id from the combined news dataframe\n",
    "news_index = {}\n",
    "\n",
    "for n,data in tqdm(news.to_pandas().iterrows(),total = len(news)):\n",
    "    news_index[data['did']] = n + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Encoding new's category and subcategories\n",
    "\n",
    "cat = {}\n",
    "subcat = {}\n",
    "\n",
    "temp = news['cat'].unique()\n",
    "for i in tqdm(range(len(temp)),total = len(temp)):\n",
    "    cat[temp[i]] = i + 1\n",
    "    \n",
    "temp = news['sub_cat'].unique()\n",
    "for i in tqdm(range(len(temp)),total = len(temp)):\n",
    "    subcat[temp[i]] = i + 1\n",
    "\n",
    "# Replacing did, cat and sub_cate with their respective indexes in the news dataframe\n",
    "news = news.replace({'did': [i for i in news_index], 'cat': [i for i in cat], 'sub_cat': [i for i in subcat]},{'did': [str(news_index[i]) for i in news_index], 'cat': [str(cat[i]) for i in cat], 'sub_cat': [str(subcat[i]) for i in subcat]}).astype('int')\n",
    "news = news.set_index('did').to_pandas().T.to_dict()\n",
    "\n",
    "# Freeing up memory by nulling variables\n",
    "temp = None\n",
    "cat = None\n",
    "subcat = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will replace the news id with their corresponding news_index in the behaviours dataframe in the pre-process step 3."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Pre-process 3: Unroll items in history column\n",
    "\n",
    "As an example, consider the below row in behaviours dataframe\n",
    "\n",
    "|impression_id | uid | time | history | impressions |\n",
    "| :-: | :-: | :-: |:-: |:-: |\n",
    "| 1 | U64099 | 11/19/2019 11:37:45 AM |\tN121133 N104200 N43255 N55860 N128965 N38014 | N78206-0 N26368-1 N7578-1 N58592-0 N19858-0 |\n",
    "\n",
    "We have to convert one history column with many news id to multiple history columns with single news id. \n",
    "\n",
    "| hist_0 | hist_1 | hist_2 | hist_3 | hist_4 | hist_5 |\n",
    "| :-: | :-: | :-: | :-: | :-: | :-: |\n",
    "|\tN121133 | N104200 | N43255 | N55860 | N128965 | N38014 |\n",
    "\n",
    "Finally, we will add the news category and subcategory for these news ids. The row after these transformations would look like this:\n",
    "\n",
    "|impression_id | uid | time | hist_cat_0 | hist_cat_1 | hist_cat_2 | ... | hist_subcat_3 | hist_subcat_4 | hist_subcat_5 | impressions |\n",
    "| :-: | :-: | :-: |:-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |\n",
    "| 1 | U64099 | 11/19/2019 11:37:45 AM |\tsports | finance | entertainment | ... | markets | celebrity | football_nfl | N78206-0 N26368-1 N7578-1 N58592-0 N19858-0 |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observed that the maximum number of items in the history column is 400 & the mean is 30.<br>\n",
    "We have to unroll the same number of history items for each row and thus would define a variable that will control this number. Feel free to increase this number to include more items.\n",
    "\n",
    "For this tutorial, `max_hist` i.e. the number of history columns to be unrolled is set to 10. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_hist = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets expand the history column into individual columns of histories with the limit `max_hist`. \n",
    "During expansion, we will use the last `max_hist` items from history column as those items would be the most recent ones (since the news id in this column is ordered by time).\n",
    "\n",
    "In addition, we're also saving the length of history in a seperate column which could be used as a feature too.\n",
    "\n",
    "We will also replace the news id with their news indexes in the behaviours dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Making a new gdf for storing history\n",
    "hist = cudf.DataFrame() \n",
    "\n",
    "# Splitting the long string of history into several columns\n",
    "hist[['hist_'+str(i) for i in range(max_hist)]] = behaviors_train.history.str.rsplit(n=max_hist,expand=True).fillna(0)[[i for i in range(1,max_hist+1)]]\n",
    "\n",
    "# Replacing string news id in history with respective indexes\n",
    "hist = hist.replace([i for i in news_index],[str(news_index[i]) for i in news_index]).astype('int')\n",
    "\n",
    "# Appending news category corresponding to these newly created history columns\n",
    "behaviors_train[['hist_cat_'+str(i) for i in range(max_hist)]] = hist.replace([int(i) for i in news],[int(news[i]['cat']) for i in news])\n",
    "\n",
    "# Appending news sub-category corresponding to these newly created history columns\n",
    "behaviors_train[['hist_subcat_'+str(i) for i in range(max_hist)]] = hist.replace([int(i) for i in news],[int(news[i]['sub_cat']) for i in news])\n",
    "\n",
    "# Creating a column for the length of history \n",
    "behaviors_train['hist_count'] = behaviors_train.history.str.count(\" \")+1\n",
    "\n",
    "# Dropping the long string history column\n",
    "behaviors_train = behaviors_train.drop(['history'],axis=1)\n",
    "\n",
    "# Freeing up memory by nulling variables\n",
    "hist = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Repeating the same for validation set\n",
    "hist = cudf.DataFrame()\n",
    "\n",
    "hist[['hist_'+str(i) for i in range(max_hist)]] = behaviors_valid.history.str.rsplit(n=max_hist,expand=True).fillna(0)[[i for i in range(1,max_hist+1)]]\n",
    "\n",
    "hist = hist.replace([i for i in news_index],[str(news_index[i]) for i in news_index]).astype('int')\n",
    "\n",
    "behaviors_valid[['hist_cat_'+str(i) for i in range(max_hist)]] = hist.replace([int(i) for i in news],[int(news[i]['cat']) for i in news])\n",
    "\n",
    "behaviors_valid[['hist_subcat_'+str(i) for i in range(max_hist)]] = hist.replace([int(i) for i in news],[int(news[i]['sub_cat']) for i in news])\n",
    "\n",
    "behaviors_valid['hist_count'] = behaviors_valid.history.str.count(\" \")+1\n",
    "\n",
    "behaviors_valid = behaviors_valid.drop(['history'],axis=1)\n",
    "\n",
    "hist = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pre-process 4 : Unroll items in impression column\n",
    "\n",
    "As an example, consider the below expanded history column row from the behaviours dataframe:\n",
    "\n",
    "|impression_id | uid | time | hist_cat_0 | hist_cat_1 | hist_cat_2 | ... | hist_subcat_3 | hist_subcat_4 | hist_subcat_5 | impressions |\n",
    "| :-: | :-: | :-: |:-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |\n",
    "| 1 | U64099 | 11/19/2019 11:37:45 AM |\tsports | finance | entertainment | ... | markets | celebrity | football_nfl | N78206-0 N26368-1 N7578-1 N58592-0 N19858-0 |\n",
    "\n",
    "The impression column contains the positive and negetive samples as a long string.\n",
    "\n",
    "After unrolling one row of impressions into multiple rows, the resulting dataframe would look like this:\n",
    "\n",
    "|impression_id | uid | time | hist_cat_0 | hist_cat_1 | hist_cat_2 | ... | hist_subcat_3 | hist_subcat_4 | hist_subcat_5 | impressions | label |\n",
    "| :-: | :-: | :-: |:-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |\n",
    "| 1 | U64099 | 11/19/2019 11:37:45 AM |\tsports | finance | entertainment | ... | markets | celebrity | football_nfl | N78206 | 0 |\n",
    "| 1 | U64099 | 11/19/2019 11:37:45 AM |\tsports | finance | entertainment | ... | markets | celebrity | football_nfl | N26368 | 1 |\n",
    "| 1 | U64099 | 11/19/2019 11:37:45 AM |\tsports | finance | entertainment | ... | markets | celebrity | football_nfl | N7578 | 1 |\n",
    "| 1 | U64099 | 11/19/2019 11:37:45 AM |\tsports | finance | entertainment | ... | markets | celebrity | football_nfl | N58592 | 0 |\n",
    "| 1 | U64099 | 11/19/2019 11:37:45 AM |\tsports | finance | entertainment | ... | markets | celebrity | football_nfl | N19858 | 0 |\n",
    "\n",
    "Note that all the 5 generated rows have the same impression_id, uid, time and history data columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have observed that the maximum number of items in impression column is 105 and the mean is 40. <br>\n",
    "We will limit the items to unroll from impression column by defining the variable `max_impr` and set it to 100. Feel free to increase or decrease this value.\n",
    "\n",
    "**Note** - Make sure you're using a GPU with atleast 16GB memory to avoid OOM errors with the below set values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_impr = 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Row expansion is a memory and I/O intensive operation thus, we will perform it in 2 steps. We will first create a dictionary with impression-label and later merge it with the train set.\n",
    "\n",
    "Let's convert impression column as dictionary of list with impression id as key and the impression items as value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For train dataset\n",
    "impr_train = behaviors_train.set_index('impression_id').impressions.to_pandas().str.split()\n",
    "impr_train = impr_train.to_dict()\n",
    "behaviors_train = behaviors_train.drop(['impressions'],axis=1)\n",
    "\n",
    "# For validation dataset\n",
    "impr_valid = behaviors_valid.set_index('impression_id').impressions.to_pandas().str.split()\n",
    "impr_valid = impr_valid.to_dict()\n",
    "behaviors_valid = behaviors_valid.drop(['impressions'],axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the number of negative samples (labelled with 0) are greater than the positive samples, we can define a ratio between the negatives and positives to sampling a balanced distribution. <br>\n",
    "For now, let's set this variable to -1 to include all the samples from the impression column. Feel free to set this variable to a value greater than 1 to downsample the negative samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np_ratio = -1 # ratio of neg-to-pos samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Iterating over the above dictionary to create a new dataframe with individual impression news in a new row with its corresponding label.<br>\n",
    "This is a time consuming operation!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# For train set\n",
    "\n",
    "imp_id = []\n",
    "imp_list = []\n",
    "imp_label = []\n",
    "for i in tqdm(impr_train,total = len(impr_train)):\n",
    "    imp, label = np.transpose([[news_index[imp.split('-')[0]],imp.split('-')[1]] for imp in impr_train[i]])\n",
    "    pos = (label == '1').sum()\n",
    "    neg = 0\n",
    "    for j in range(min(len(imp),max_impr)):\n",
    "        if label[j] == '0' and np_ratio > -1:\n",
    "            if neg <= pos*np_ratio :\n",
    "                imp_id.append(i)\n",
    "                imp_list.append(imp[j])\n",
    "                imp_label.append(label[j])\n",
    "                neg+=1\n",
    "        else:\n",
    "            imp_id.append(i)\n",
    "            imp_list.append(imp[j])\n",
    "            imp_label.append(label[j])\n",
    "\n",
    "impr_train = None \n",
    "\n",
    "# Creating a new gdf with impression id, news id and its label\n",
    "impressions_train = cudf.DataFrame({'imp_id': imp_id,'impr': imp_list,'label': imp_label})\n",
    "\n",
    "# Appending news category corresponding to above impression news in the above created DataFrame\n",
    "impressions_train['impr_cat'] = impressions_train['impr'].replace([int(i) for i in news],[int(news[i]['cat']) for i in news])\n",
    "\n",
    "# Appending news sub-category corresponding to above impression news in above created DataFrame\n",
    "impressions_train['impr_subcat'] = impressions_train['impr'].replace([int(i) for i in news],[int(news[i]['sub_cat']) for i in news])\n",
    "\n",
    "# Droping impr columns as news data is added for it.\n",
    "impressions_train = impressions_train.drop(['impr'],axis=1)\n",
    "\n",
    "impressions_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For validation set\n",
    "\n",
    "imp_id = []\n",
    "imp_list = []\n",
    "imp_label = []\n",
    "for i in tqdm(impr_valid,total = len(impr_valid)):\n",
    "    imp, label = np.transpose([[news_index[imp.split('-')[0]],imp.split('-')[1]] for imp in impr_valid[i]])\n",
    "    pos = (label == '1').sum()\n",
    "    neg = 0\n",
    "    for j in range(min(len(imp),max_impr)):\n",
    "        if label[j] == '0' and np_ratio > -1:\n",
    "            if neg <= pos*np_ratio :\n",
    "                imp_id.append(i)\n",
    "                imp_list.append(imp[j])\n",
    "                imp_label.append(label[j])\n",
    "                neg+=1\n",
    "        else:\n",
    "            imp_id.append(i)\n",
    "            imp_list.append(imp[j])\n",
    "            imp_label.append(label[j])\n",
    "\n",
    "impr_valid = None \n",
    "\n",
    "impressions_valid = cudf.DataFrame({'imp_id': imp_id,'impr': imp_list,'label': imp_label})\n",
    "impressions_valid['impr_cat'] = impressions_valid['impr'].replace([int(i) for i in news],[int(news[i]['cat']) for i in news])\n",
    "impressions_valid['impr_subcat'] = impressions_valid['impr'].replace([int(i) for i in news],[int(news[i]['sub_cat']) for i in news])\n",
    "impressions_valid = impressions_valid.drop(['impr'],axis=1)\n",
    "impressions_valid.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Freeing up memory by nulling variables\n",
    "imp_id = None\n",
    "imp_list = None\n",
    "imp_label = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pre-process 5: Merge behaviour and news datasets \n",
    "\n",
    "Collating all the required columns from both behaviours and news dataset would make the feature engineering process much more faster. \n",
    "\n",
    "We will merge the history columns (from behaviors dataframe) with the above created impression data and save it as a parquet file. <br>\n",
    "We will also re-initialize RMM to allow us to perform memory intensive merge operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# For training set\n",
    "rmm.reinitialize(managed_memory=True)\n",
    "\n",
    "final_data = impressions_train.merge(behaviors_train,left_on = ['imp_id'],right_on = ['impression_id']).drop(['imp_id'],axis=1)\n",
    "final_data = cudf.concat([final_data.drop(['time'],axis=1).astype('int'),final_data['time']],axis=1)\n",
    "final_data.to_parquet(os.path.join(data_input_path, 'train.parquet'), compression = None)\n",
    "\n",
    "# Freeing up memory by nulling variables\n",
    "final_data=None\n",
    "impressions_train = None\n",
    "behaviors_train = None\n",
    "\n",
    "#client.run(_rmm_pool)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For validation set\n",
    "\n",
    "final_data = impressions_valid.merge(behaviors_valid,left_on = ['imp_id'],right_on = ['impression_id']).drop(['imp_id'],axis=1)\n",
    "final_data = cudf.concat([final_data.drop(['time'],axis=1).astype('int'),final_data['time']],axis=1)\n",
    "final_data.to_parquet(os.path.join(data_input_path, 'valid.parquet'),compression = None)\n",
    "\n",
    "# Freeing up memory by nulling variables\n",
    "final_data=None\n",
    "impressions_valid = None\n",
    "behaviors_valid = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we have our initial pre-processed data - **train.parquet** and **valid.parquet** - that would be used for feature engineering and further processing. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Feature Engineering - time-based features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get started with NVTabular, we'll first use it for creating simple time based features that would be extracted from the timestamp column in the behaviours data. <br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Declaring features of train set that we created\n",
    "\n",
    "cat_features = [\n",
    " 'hist_cat_0',\n",
    " 'hist_subcat_0',\n",
    " 'hist_cat_1',\n",
    " 'hist_subcat_1',\n",
    " 'hist_cat_2',\n",
    " 'hist_subcat_2',\n",
    " 'hist_cat_3',\n",
    " 'hist_subcat_3',\n",
    " 'hist_cat_4',\n",
    " 'hist_subcat_4',\n",
    " 'hist_cat_5',\n",
    " 'hist_subcat_5',\n",
    " 'hist_cat_6',\n",
    " 'hist_subcat_6',\n",
    " 'hist_cat_7',\n",
    " 'hist_subcat_7',\n",
    " 'hist_cat_8',\n",
    " 'hist_subcat_8',\n",
    " 'hist_cat_9',\n",
    " 'hist_subcat_9',\n",
    " 'impr_cat',\n",
    " 'impr_subcat',\n",
    " 'impression_id',\n",
    " 'uid']\n",
    "\n",
    "cont_features = ['hist_count']\n",
    "\n",
    "labels = ['label']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creating time based features by extracting the relevant elements using cuDF\n",
    "\n",
    "datetime = nvt.ColumnGroup(['time']) >> (lambda col: cudf.to_datetime(col,format=\"%m/%d/%Y %I:%M:%S %p\"))\n",
    "\n",
    "hour = datetime >> (lambda col: col.dt.hour) >> nvt.ops.Rename(postfix = '_hour')\n",
    "minute = datetime >> (lambda col: col.dt.minute) >> nvt.ops.Rename(postfix = '_minute')\n",
    "seconds = datetime >> (lambda col: col.dt.second) >> nvt.ops.Rename(postfix = '_second')\n",
    "\n",
    "weekday = datetime >> (lambda col: col.dt.weekday) >> nvt.ops.Rename(postfix = '_wd')\n",
    "day = datetime >> (lambda col: cudf.to_datetime(col, unit='s').dt.day) >> nvt.ops.Rename(postfix = '_day')\n",
    "\n",
    "week = day >> (lambda col: (col/7).floor().astype('int')) >> nvt.ops.Rename(postfix = '_week')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To create embedding tables and segregate the pre-processing functions (normalization, fill missing values etc.) among categorical and continous features, we define and pipeline them using NVTabular's operator overloading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_features = cat_features + hour + minute + seconds + weekday + day + week + datetime >> nvt.ops.Categorify(out_path = data_output_path)\n",
    "cont_features = cont_features >> nvt.ops.FillMissing() >> nvt.ops.NormalizeMinMax()\n",
    "labels = ['label']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can visualize the complete workflow pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output = cat_features + cont_features\n",
    "output.graph"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run this graph, we would create a workflow object that calculates statistics and performs the relevant transformations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "proc = nvt.Workflow(cat_features + cont_features + labels[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize a nvt.Dataset from parquet file that was created in step 4.\n",
    "\n",
    "data_train = nvt.Dataset(os.path.join(data_input_path, \"train.parquet\"), engine=\"parquet\",part_size=\"256MB\")\n",
    "data_valid = nvt.Dataset(os.path.join(data_input_path, \"valid.parquet\"), engine=\"parquet\",part_size=\"256MB\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we are going to train the DNNs using HugeCTR, we need to conform to the following dtypes:\n",
    "\n",
    "- categorical feature columns in int64\n",
    "- continuous feature columns in float32\n",
    "- label columns in float32\n",
    "    \n",
    "We will make a dictionary containing names of columns as key and the required datatype as value. This dictionary will be used by NVTabular for type casting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict_dtypes={}\n",
    "\n",
    "for col in cat_features.columns:\n",
    "    dict_dtypes[col] = np.int64\n",
    "\n",
    "for col in cont_features.columns:\n",
    "    dict_dtypes[col] = np.float32\n",
    "\n",
    "for col in labels:\n",
    "    dict_dtypes[col] = np.float32"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's fit the workflow on the training set to record the statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "proc.fit(data_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we apply the transformation to the dataset and persist it to disk as parquet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# For training set\n",
    "proc.transform(data_train).to_parquet(output_path= output_train_path,\n",
    "                                shuffle=nvt.io.Shuffle.PER_PARTITION,\n",
    "                                dtypes=dict_dtypes,\n",
    "                                out_files_per_proc=10,\n",
    "                                cats = cat_features.columns,\n",
    "                                conts = cont_features.columns,\n",
    "                                labels = labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# For validation set\n",
    "proc.transform(data_valid).to_parquet(output_path= output_valid_path,\n",
    "                                shuffle=nvt.io.Shuffle.PER_PARTITION,\n",
    "                                dtypes=dict_dtypes,\n",
    "                                out_files_per_proc=10,\n",
    "                                cats = cat_features.columns,\n",
    "                                conts = cont_features.columns,\n",
    "                                labels = labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load the NVTabular processed parquet files and look at our first NVTabular pre-processed dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df = dask_cudf.read_parquet(os.path.join(output_train_path, '*.parquet'))\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After transformation and persisting the data on the disk, the following files will be created:\n",
    "   1. parquet\n",
    "       - The number of parquet files depends on `out_files_per_proc` in `proc_train.transform()` \n",
    "   2. _file_list.txt\n",
    "       - The 1st line contains the number of parquet files\n",
    "       - The subsequent lines are the paths to each parquet file.\n",
    "   3. _metadata.json\n",
    "       - This file is used by HugeCTR in parsing the processed parquet files.\n",
    "       - 'file_stats' contains the name of the parquet files and their corresponding number of rows.\n",
    "       - 'cats' is a list of categorical features/columns in the dataset and their index.\n",
    "       - 'conts' is a list of continous/dense columns in the dataset and their index.\n",
    "       - 'labels' is a list of labels in the dataset and their index.\n",
    "       - This file shouldn't be edited manually."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the contents of _metadata.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "with open(os.path.join(output_train_path, '_metadata.json'),'r') as f:\n",
    "    metadata = json.load(f)\n",
    "\n",
    "metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we need to get the embedding size for the categorical variables. This will be an important input for defining the embedding table size to be used by HugeCTR."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nvtabular.ops import get_embedding_sizes\n",
    "embeddings_simple_time = get_embedding_sizes(proc)\n",
    "embeddings_simple_time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reformatting the above output for ease of copy paste in HugeCTRs config.json\n",
    "\n",
    "embedding_size_str_simple_time = [embeddings_simple_time[x][0] for x in cat_features.columns]\n",
    "embedding_size_str_simple_time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also check the name of the categorical and continuous features that we've defined. This should match with the cats and conts dictionaries in the _metadata.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_features.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cont_features.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before moving on to training a DNN, let's try few complex feature engineering techniques using NVTabular. We would later train DNNs on both these feature engineered dataset and compare their performances."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Feature Engineering - count and target encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now perform count and target encoding on the processed dataset generated in step 5. Let's start by defining directories for the input dataset and the output processed dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Define our worker/output directories\n",
    "dask_workdir = os.path.join(BASE_DIR, \"workdir\")\n",
    "\n",
    "# Mapping our processed_nvt output directories as input directories for new workflow.\n",
    "data_input_path = os.path.join(BASE_DIR, \"dataset\")\n",
    "\n",
    "# Defining new directories for output\n",
    "data_output_path = os.path.join(BASE_DIR, \"processed_ce-te\")\n",
    "output_train_path = os.path.join(data_output_path, \"train\")\n",
    "output_valid_path = os.path.join(data_output_path, \"valid\")\n",
    "\n",
    "# Creating and cleaning our worker/output directories\n",
    "try:\n",
    "    # Ensure BASE_DIR exists\n",
    "    if not os.path.isdir(BASE_DIR):\n",
    "        os.mkdir(BASE_DIR)\n",
    "        \n",
    "    # Make sure we have a clean worker space for Dask\n",
    "    if os.path.isdir(dask_workdir):\n",
    "        shutil.rmtree(dask_workdir)\n",
    "    os.mkdir(dask_workdir)\n",
    "\n",
    "    # Make sure we have a clean output path for our new dataset\n",
    "    if os.path.isdir(data_output_path):\n",
    "        shutil.rmtree(data_output_path)\n",
    "        \n",
    "    os.mkdir(data_output_path)\n",
    "    os.mkdir(output_train_path)\n",
    "    os.mkdir(output_valid_path)\n",
    "\n",
    "except OSError:\n",
    "    print (\"Creation of the directories failed\")\n",
    "else:\n",
    "    print (\"Successfully created the directories\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you would observe, we have created a new directory by the name `processed_ce-te`. The complete directory structure now is:\n",
    "\n",
    "basedir <br>\n",
    "&emsp; |--- workdir    \n",
    "&emsp; |--- dataset <br>\n",
    "&emsp; &emsp; |--- train <br>\n",
    "&emsp; &emsp; |--- valid  <br>\n",
    "&emsp; |--- processed_nvt <br>\n",
    "&emsp;  &emsp; |--- train <br>\n",
    "&emsp;  &emsp; |--- valid  <br>\n",
    "&emsp; |--- processed_ce-te <br>\n",
    "&emsp;  &emsp; |--- train <br>\n",
    "&emsp;  &emsp; |--- valid  <br>\n",
    "&emsp; |--- configs <br>\n",
    "&emsp; |--- weights <br>    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, defining the categorical and continous features based on processed data generated in step-5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_features = ['hist_cat_0',\n",
    " 'hist_subcat_0',\n",
    " 'hist_cat_1',\n",
    " 'hist_subcat_1',\n",
    " 'hist_cat_2',\n",
    " 'hist_subcat_2',\n",
    " 'hist_cat_3',\n",
    " 'hist_subcat_3',\n",
    " 'hist_cat_4',\n",
    " 'hist_subcat_4',\n",
    " 'hist_cat_5',\n",
    " 'hist_subcat_5',\n",
    " 'hist_cat_6',\n",
    " 'hist_subcat_6',\n",
    " 'hist_cat_7',\n",
    " 'hist_subcat_7',\n",
    " 'hist_cat_8',\n",
    " 'hist_subcat_8',\n",
    " 'hist_cat_9',\n",
    " 'hist_subcat_9',\n",
    " 'impr_cat',\n",
    " 'impr_subcat',\n",
    " 'impression_id',\n",
    " 'uid',]\n",
    "\n",
    "cont_features = ['hist_count']\n",
    "\n",
    "labels = ['label']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Count Encoding** calculates the frequency of one or more categorical features. For the purpose of this tutorial, we will count how often the user had clicked on news with the same category/sub-category in a given impression.\n",
    "\n",
    "To calculate the occurence of the same news category/sub-category in history, we will iterate over the group of rows with the same impression id. We will also consider the category/sub-category of the impression news.<br>\n",
    "Let's start by defining supportive functions for counting the category and subcategory from history columns. This supportive function will be used by `apply_rows()` in LambdaOp `create_count_features`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also limit the number of history columns to be considered for count encoding. For now, let's use all the history columns that we have in the dataset i.e. all 10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_hist = 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def add_cat_count(\n",
    "         hist_cat_0,\n",
    "         hist_cat_1,\n",
    "         hist_cat_2,\n",
    "         hist_cat_3,\n",
    "         hist_cat_4,\n",
    "         hist_cat_5,\n",
    "         hist_cat_6,\n",
    "         hist_cat_7,\n",
    "         hist_cat_8,\n",
    "         hist_cat_9,\n",
    "         impr_cat,\n",
    "         impr_cat_count,\n",
    "         k):\n",
    "    \n",
    "    # Following loop iterates over each row of columns hist_cat_0->9 and impr_cat\n",
    "    for i, temp in enumerate(zip(hist_cat_0,\n",
    "                                 hist_cat_1,\n",
    "                                 hist_cat_2,\n",
    "                                 hist_cat_3,\n",
    "                                 hist_cat_4,\n",
    "                                 hist_cat_5,\n",
    "                                 hist_cat_6,\n",
    "                                 hist_cat_7,\n",
    "                                 hist_cat_8,\n",
    "                                 hist_cat_9,\n",
    "                                 impr_cat,\n",
    "                                )):\n",
    "        \n",
    "        # Iterate over each column and check if history category matches with impression category.\n",
    "        for j in temp[:-1]:\n",
    "            if j == temp[-1]:\n",
    "                k += 1\n",
    "        \n",
    "        # Update the count in the corresponding row of output column (impr_cat_count)\n",
    "        impr_cat_count[i] = k"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def add_subcat_count(\n",
    "         hist_subcat_0,\n",
    "         hist_subcat_1,\n",
    "         hist_subcat_2,\n",
    "         hist_subcat_3,\n",
    "         hist_subcat_4,\n",
    "         hist_subcat_5,\n",
    "         hist_subcat_6,\n",
    "         hist_subcat_7,\n",
    "         hist_subcat_8,\n",
    "         hist_subcat_9,\n",
    "         impr_subcat,\n",
    "         impr_subcat_count,\n",
    "         k):\n",
    "    \n",
    "    # Following loop iterates over each row of columns hist_subcat_0->9 and impr_cat\n",
    "    for i, temp in enumerate(zip(\n",
    "                                 hist_subcat_0,\n",
    "                                 hist_subcat_1,\n",
    "                                 hist_subcat_2,\n",
    "                                 hist_subcat_3,\n",
    "                                 hist_subcat_4,\n",
    "                                 hist_subcat_5,\n",
    "                                 hist_subcat_6,\n",
    "                                 hist_subcat_7,\n",
    "                                 hist_subcat_8,\n",
    "                                 hist_subcat_9,\n",
    "                                 impr_subcat,\n",
    "                                )):\n",
    "\n",
    "        # Iterate over each column and check if history sub-category matches with impression sub-category.\n",
    "        for j in temp[:-1]:\n",
    "            if j == temp[-1]:\n",
    "                k += 1      \n",
    "        # Update the count(occurence) in corresponding row of output column (impr_cat_count)        \n",
    "        impr_subcat_count[i] = k"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To add the count encoding for 'categories' and 'sub_categories' to each row for their corresponding news_id, we will write a LambdaOp by simply inhereting from NVTabular's `Operator` class and defining the `transform` and `output_column_names` methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class create_count_features(Operator):\n",
    "    def transform(self, columns, gdf):\n",
    "        if columns[-1] == 'impr_cat':\n",
    "            gdf = gdf.apply_rows(add_cat_count,incols = ['hist_cat_{}'.format(i) for i in range(max_hist)]+['impr_cat'],outcols = {'impr_cat_count': np.int64},kwargs={'k': 0})\n",
    "            return(gdf.drop(columns,axis=1))\n",
    "        if columns[-1] == 'impr_subcat':\n",
    "            gdf = gdf.apply_rows(add_subcat_count,incols = ['hist_subcat_{}'.format(i) for i in range(max_hist)]+['impr_subcat'],outcols = {'impr_subcat_count': np.int64},kwargs={'k': 0})\n",
    "            return(gdf.drop(columns,axis=1))\n",
    "\n",
    "    def output_column_names(self, columns):\n",
    "        col = []\n",
    "        if columns[-1] == 'impr_cat':\n",
    "            col.append('impr_cat_count')\n",
    "        if columns[-1] == 'impr_subcat':\n",
    "            col.append('impr_subcat_count')\n",
    "        return col\n",
    "\n",
    "    def dependencies(self):\n",
    "        return None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Target encoding** is used to average the target value by some category/group. This technique is used to find numeric mean relationship between the categorical features and target.\n",
    "\n",
    "We have observed that the hist_cat columns are the most suitable for target encoding. Rather than using just 1 history category column, we also found that a group of history columns encode better probabilities with the target variable. \n",
    "\n",
    "For this tutorial, we are going use 5 history category columns, in a moving window fashion, along with the impression category column to calculate the target encoding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "te_columns = [['hist_cat_'+str(j) for j in range(i-5+1, i+1)] + ['impr_cat'] for i in range(4, max_hist)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_encode = (\n",
    "    te_columns >>\n",
    "    nvt.ops.TargetEncoding(\n",
    "        ['label'],\n",
    "        out_path = BASE_DIR,\n",
    "        kfold=5,\n",
    "        p_smooth=20,\n",
    "        out_dtype=\"float32\",\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll also create the time based features in the same way as we did in step-6."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "datetime = nvt.ColumnGroup(['time']) >> (lambda col: cudf.to_datetime(col,format=\"%m/%d/%Y %I:%M:%S %p\"))\n",
    "\n",
    "hour = datetime >> (lambda col: col.dt.hour) >> nvt.ops.Rename(postfix = '_hour')\n",
    "minute = datetime >> (lambda col: col.dt.minute) >> nvt.ops.Rename(postfix = '_minute')\n",
    "seconds = datetime >> (lambda col: col.dt.second) >> nvt.ops.Rename(postfix = '_second')\n",
    "\n",
    "weekday = datetime >> (lambda col: col.dt.weekday) >> nvt.ops.Rename(postfix = '_wd')\n",
    "day = datetime >> (lambda col: cudf.to_datetime(col, unit='s').dt.day) >> nvt.ops.Rename(postfix = '_day')\n",
    "\n",
    "week = day >> (lambda col: (col/7).floor().astype('int')) >> nvt.ops.Rename(postfix = '_week')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_count_encode = ['hist_cat_{}'.format(i) for i in range(max_hist)] + ['impr_cat'] >> create_count_features()\n",
    "\n",
    "subcat_count_encode = ['hist_subcat_{}'.format(i) for i in range(max_hist)] + ['impr_subcat'] >> create_count_features()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_features = cat_features + datetime + hour + minute + seconds + weekday + day + week >> nvt.ops.Categorify(out_path = data_output_path)\n",
    "cont_features = cont_features + cat_count_encode + subcat_count_encode >> nvt.ops.FillMissing() >> nvt.ops.NormalizeMinMax()\n",
    "cont_features += target_encode >> nvt.ops.Rename(postfix = '_TE')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can visualize the complete workflow pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output = cat_features + cont_features\n",
    "output.graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "proc = nvt.Workflow(cat_features + cont_features + labels[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We initialize a nvt.Dataset object from parquet dataset that was created in step 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train = nvt.Dataset(os.path.join(data_input_path, \"train.parquet\"), engine=\"parquet\",part_size=\"256MB\")\n",
    "data_valid = nvt.Dataset(os.path.join(data_input_path, \"valid.parquet\"), engine=\"parquet\",part_size=\"256MB\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict_dtypes={}\n",
    "\n",
    "for col in cat_features.columns:\n",
    "    dict_dtypes[col] = np.int64\n",
    "\n",
    "for col in cont_features.columns:\n",
    "    dict_dtypes[col] = np.float32\n",
    "\n",
    "for col in labels:\n",
    "    dict_dtypes[col] = np.float32"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's fit the workflow on our training dataset to record the statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "proc.fit(data_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# For training set\n",
    "proc.transform(data_train).to_parquet(output_path=output_train_path,\n",
    "                                shuffle=nvt.io.Shuffle.PER_PARTITION,\n",
    "                                dtypes=dict_dtypes,\n",
    "                                out_files_per_proc=10,\n",
    "                                cats = cat_features.columns,\n",
    "                                conts = cont_features.columns,\n",
    "                                labels = labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# For validation set\n",
    "proc.transform(data_valid).to_parquet(output_path=output_valid_path,\n",
    "                                shuffle=nvt.io.Shuffle.PER_PARTITION,\n",
    "                                dtypes=dict_dtypes,\n",
    "                                out_files_per_proc=10,\n",
    "                                cats = cat_features.columns,\n",
    "                                conts = cont_features.columns,\n",
    "                                labels = labels)\n",
    "rmm.reinitialize(managed_memory=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a quick look at the contents of _metadata.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(os.path.join(output_train_path, '_metadata.json'),'r') as f:\n",
    "    metadata = json.load(f)\n",
    "\n",
    "metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nvtabular.ops import get_embedding_sizes\n",
    "embeddings_count_encode =  get_embedding_sizes(proc)\n",
    "embeddings_count_encode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reformatting the above output for ease of copy paste in HugeCTRs config.json\n",
    "\n",
    "embedding_size_str_count_encode = [embeddings_count_encode[x][0] for x in cat_features.columns]\n",
    "embedding_size_str_count_encode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have 2 versions of our dataset ready, one with time based features and other with count + target encoded features, we can start training a few DNNs using HugeCTR."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Train DNN with HugeCTR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we would be training Deep Learning Recommendation Model (DLRM) using HugeCTR's low level python API. We would also be using the inference python API for evaluation on the validation set.\n",
    "\n",
    "We would be training 2 models, one each for the 2 datasets that we've processed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train config json for simple time based features dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python low level train API requires a train config.json with the arguments and definitions of various training parameters like - optimizer, iterations, neural architecture and dataset.<br>\n",
    "As a first step, we will develop this config file for our feature engineered dataset and DLRM model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define paths to save the config and the weights\n",
    "\n",
    "config_file_path = os.path.join(config_output_path,'dlrm_fp32_simple-time_1gpu.json')\n",
    "weights_output_path = os.path.join(weights_path,'dlrm_fp32_simple-time_1gpu/')\n",
    "\n",
    "# Creating Directory inside weights folder\n",
    "if os.path.isdir(weights_output_path):\n",
    "    shutil.rmtree(weights_output_path)\n",
    "os.mkdir(weights_output_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For DLRM model, lets consider the [dlrm_fp32_64k.json](https://github.com/NVIDIA/HugeCTR/tree/v2.3/samples/dlrm/) sample file that is available on HugeCTR's github repository, and modify it for our dataset.\n",
    "\n",
    "The parameters that we should modify are:  \n",
    "\n",
    "- solver:\n",
    "        - max_iter: Num. of samples / batch size\n",
    "        - gpu: List of GPU IDs to use for training\n",
    "        - batchsize: num. of samples to process in the batch training mode\n",
    "        - eval_interval: Num. of iterations after which evaluation should trigger on the validation set\n",
    "\n",
    "\n",
    "- optimizer:\n",
    "        - type: Adam \n",
    "        - learning_rate: 1e-4 (smaller value to begin with)\n",
    "\n",
    "\n",
    "- layers\n",
    "        - format: Parquet (since our dataset is in parquet)\n",
    "        - source and eval_source: Path to _file_list.txt for the train and eval dataset produced by NVTabular\n",
    "        - slot_num: For LocalizedSlot, set it to the number of categorical features\n",
    "        - max_feature_num_per_sample: For LocalizedSlot, this can be the same as slot_num\n",
    "        - slot_size_array: Cardinality of the categorical features (in the same order as column names in 'cats' dictionary of _metadata.json)\n",
    "        - embedding_vec_size: dimension of the embedding vectors for the categorical features\n",
    "        - label_dim: Labels dimension\n",
    "        - dense_dim: Number of dense/continous features\n",
    "        - sparse: Dimensions of categorical features\n",
    "        - DLRM layer fc3: Output dimension of fc3 should be the same as embedding_vec_size\n",
    "         \n",
    "We've developed one such train config.json below with the appropriate path to the dataset and default batch size for a 32GB GPU.<br>\n",
    "Let's make use of the data path and other variables we've defined in the steps above and re-define the ones which may have changed throughout the pre-processing step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Path to the simple time based feature processed dataset\n",
    "output_train_path = os.path.join(BASE_DIR, \"processed_nvt/train\")\n",
    "output_valid_path = os.path.join(BASE_DIR, \"processed_nvt/valid\")\n",
    "\n",
    "# Model related parameter\n",
    "embedding_vec_size = 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = {\n",
    "        \"type\": \"Adam\",\n",
    "        \"update_type\": \"Local\",\n",
    "        \"adam_hparam\": {\n",
    "            \"learning_rate\": 0.001,\n",
    "            \"beta1\": 0.9,\n",
    "            \"beta2\": 0.999,\n",
    "            \"epsilon\": 1e-07,\n",
    "            \"warmup_steps\": 10000,\n",
    "            \"decay_start\": 20000,\n",
    "            \"decay_steps\": 200000,\n",
    "            \"decay_power\": 1,\n",
    "            \"end_lr\": 0.000001\n",
    "        }\n",
    "    }\n",
    "\n",
    "layers = [\n",
    "        {\n",
    "            \"name\": \"data\",\n",
    "            \"type\": \"Data\",\n",
    "            \"format\": \"Parquet\",\n",
    "            \"slot_size_array\": embedding_size_str_simple_time,\n",
    "            \"source\": output_train_path+\"/_file_list.txt\",\n",
    "            \"eval_source\": output_valid_path+\"/_file_list.txt\",\n",
    "            \"check\": \"None\",\n",
    "            \"label\": {\n",
    "                \"top\": \"label\",\n",
    "                \"label_dim\": 1\n",
    "            },\n",
    "            \"dense\": {\n",
    "                \"top\": \"dense\",\n",
    "                \"dense_dim\": 1\n",
    "            },\n",
    "            \"sparse\": [\n",
    "                {\n",
    "                    \"top\": \"data1\",\n",
    "                    \"type\": \"LocalizedSlot\",\n",
    "                    \"max_feature_num_per_sample\": len(embeddings_simple_time),\n",
    "                    \"max_nnz\": 1,\n",
    "                    \"slot_num\": len(embeddings_simple_time)\n",
    "                }\n",
    "            ]\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"sparse_embedding1\",\n",
    "            \"type\": \"LocalizedSlotSparseEmbeddingHash\",\n",
    "            \"bottom\": \"data1\",\n",
    "            \"top\": \"sparse_embedding1\",\n",
    "            \"sparse_embedding_hparam\": {\n",
    "                \"slot_size_array\": embedding_size_str_simple_time,\n",
    "                \"embedding_vec_size\": embedding_vec_size,\n",
    "                \"combiner\": 0\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc1\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"dense\",\n",
    "            \"top\": \"fc1\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 512\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu1\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc1\",\n",
    "            \"top\": \"relu1\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc2\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu1\",\n",
    "            \"top\": \"fc2\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 256\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu2\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc2\",\n",
    "            \"top\": \"relu2\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc3\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu2\",\n",
    "            \"top\": \"fc3\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": embedding_vec_size\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu3\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc3\",\n",
    "            \"top\": \"relu3\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"interaction1\",\n",
    "            \"type\": \"Interaction\",\n",
    "            \"bottom\": [\n",
    "                \"relu3\",\n",
    "                \"sparse_embedding1\"\n",
    "            ],\n",
    "            \"top\": \"interaction1\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc4\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"interaction1\",\n",
    "            \"top\": \"fc4\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1024\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu4\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc4\",\n",
    "            \"top\": \"relu4\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc5\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu4\",\n",
    "            \"top\": \"fc5\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1024\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu5\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc5\",\n",
    "            \"top\": \"relu5\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc6\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu5\",\n",
    "            \"top\": \"fc6\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 512\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu6\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc6\",\n",
    "            \"top\": \"relu6\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc7\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu6\",\n",
    "            \"top\": \"fc7\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 256\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu7\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc7\",\n",
    "            \"top\": \"relu7\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc8\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu7\",\n",
    "            \"top\": \"fc8\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"loss\",\n",
    "            \"type\": \"BinaryCrossEntropyLoss\",\n",
    "            \"bottom\": [\n",
    "                \"fc8\",\n",
    "                \"label\"\n",
    "            ],\n",
    "            \"top\": \"loss\"\n",
    "        }\n",
    "    ]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = {\n",
    "    \"optimizer\": optimizer,\n",
    "    \"layers\": layers\n",
    "}\n",
    "\n",
    "with open(config_file_path,'w') as f:\n",
    "    json.dump(config,f,indent = 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can start the training using the above config file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from hugectr import Session, solver_parser_helper,get_learning_rate_scheduler\n",
    "from mpi4py import MPI\n",
    "\n",
    "# Solver related parameters\n",
    "NUM_GPUS = [0]                                                     # GPUs used for training\n",
    "json_file = config_file_path                                       # Path to the json config file\n",
    "batchsize = 2048                                                   # Batch size used for training\n",
    "batchsize_eval = 2048                                              # Batch size used for evaluation\n",
    "max_eval_batches = 3768                                            # Iterations required to go through the complete validation set with the set batchsize_eval\n",
    "\n",
    "# Training related parameters\n",
    "num_iter = 30001                                                   # Iterations to train the model for\n",
    "eval_trigger = 10000                                               # Start evaluation after these iterations\n",
    "snapshot_trigger = 10000                                           # Save model checkpoints after these iterations\n",
    "\n",
    "solver_config = solver_parser_helper(\n",
    "                                    seed = 0,\n",
    "                                    batchsize = batchsize,                   # Minibatch size for training\n",
    "                                    batchsize_eval = batchsize_eval,         # Minibatch size for eval \n",
    "                                    max_eval_batches = max_eval_batches,     # Max no. of eval batches on which eval will be done\n",
    "                                    model_file = \"\",                         # Load any pretrained model , if training from scratch, leave empty\n",
    "                                    embedding_files = [],                    # Path to trained embedding table, if training from scratch then leave empty\n",
    "                                    vvgpu = [NUM_GPUS],                      # GPU Indices to be used ofr training\n",
    "                                    use_mixed_precision = False,             # Flag to indicate use of Mixed precision training \n",
    "                                    scaler = 1024,                           # To be set when MixedPrecisiontraining is ON\n",
    "                                    i64_input_key = True,                    # As we are using Parquet from NVTabular, I64 should be true \n",
    "                                    use_algorithm_search = False,            # Enable algo search within the fully connected-layers\n",
    "                                    use_cuda_graph = False,                  # Enable cuda graph for forward and back proppogation\n",
    "                                    repeat_dataset = True                    # Repeat the dataset for training, True for Non Epoch Based Training\n",
    "                                    )\n",
    "\n",
    "lr_sch = get_learning_rate_scheduler(json_file)                    # Get learning rate statistics from optimizers     \n",
    "\n",
    "sess = Session(solver_config, json_file)                           # Initialise a Session Object\n",
    "sess.start_data_reading()                                          # Start Data Reading\n",
    "\n",
    "for i in range(num_iter):                                          # Start training loop\n",
    "    lr = lr_sch.get_next()                                         # Update learning rate parameters                                   \n",
    "    sess.set_learning_rate(lr)                                     # Pass the updated learning rate to the session\n",
    "    sess.train()                                                   # Train on 1 iteration on 1 Minibatch\n",
    "\n",
    "    if (i%1000 == 0):\n",
    "        loss = sess.get_current_loss()                             # Returns the loss value for the current iteration.\n",
    "        print(\"[HUGECTR][INFO] iter: {}; loss: {:.6f}; lr: {:.6f}\".format(i, loss, lr))\n",
    "    if (i%eval_trigger == 0 and i != 0):\n",
    "        sess.check_overflow()                                      # Checks whether any embedding has encountered overflow\n",
    "        sess.copy_weights_for_evaluation()                         # Copies the weights of the dense network from training layers to evaluation layers.\n",
    "        for _ in range(solver_config.max_eval_batches):\n",
    "            sess.eval()                                            # Calculates the evaluation metrics based on one minibatch of evaluation data\n",
    "        metrics = sess.get_eval_metrics()                          # Returns the average evaluation metrics of several minibatches of evaluation data.\n",
    "        print(\"[HUGECTR][INFO] iter: {}, {}\".format(i, metrics))\n",
    "    if (i%snapshot_trigger == 0 and i != 0):\n",
    "        sess.download_params_to_files(weights_output_path , i)     # Saving model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train config json for count and target encoded features dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following the same methodology as done above, we will modify the DLRM config file for this version of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define paths to save the config and the weights\n",
    "\n",
    "config_file_path = os.path.join(config_output_path, 'dlrm_fp32_count-target-encode_1gpu.json')\n",
    "weights_output_path = os.path.join(weights_path,'dlrm_fp32_count-target-encode_1gpu/')\n",
    "\n",
    "# Creating Directory inside weights folder\n",
    "if os.path.isdir(weights_output_path):\n",
    "    shutil.rmtree(weights_output_path)\n",
    "os.mkdir(weights_output_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Path to the simple time based feature processed dataset\n",
    "output_train_path = os.path.join(BASE_DIR, \"processed_ce-te/train\")\n",
    "output_valid_path = os.path.join(BASE_DIR, \"processed_ce-te/valid\")\n",
    "\n",
    "# Model related parameter\n",
    "embedding_vec_size = 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = {\n",
    "        \"type\": \"Adam\",\n",
    "        \"update_type\": \"Local\",\n",
    "        \"adam_hparam\": {\n",
    "            \"learning_rate\": 0.001,\n",
    "            \"beta1\": 0.9,\n",
    "            \"beta2\": 0.999,\n",
    "            \"epsilon\": 1e-07,\n",
    "            \"warmup_steps\": 10000,\n",
    "            \"decay_start\": 20000,\n",
    "            \"decay_steps\": 200000,\n",
    "            \"decay_power\": 1,\n",
    "            \"end_lr\": 1e-06\n",
    "        }\n",
    "    }\n",
    "\n",
    "layers = [\n",
    "        {\n",
    "            \"name\": \"data\",\n",
    "            \"type\": \"Data\",\n",
    "            \"format\": \"Parquet\",\n",
    "            \"slot_size_array\": embedding_size_str_count_encode,\n",
    "            \"source\": output_train_path+\"/_file_list.txt\",\n",
    "            \"eval_source\": output_valid_path+\"/_file_list.txt\",\n",
    "            \"check\": \"None\",\n",
    "            \"label\": {\n",
    "                \"top\": \"label\",\n",
    "                \"label_dim\": 1\n",
    "            },\n",
    "            \"dense\": {\n",
    "                \"top\": \"dense\",\n",
    "                \"dense_dim\": 9\n",
    "            },\n",
    "            \"sparse\": [\n",
    "                {\n",
    "                    \"top\": \"data1\",\n",
    "                    \"type\": \"LocalizedSlot\",\n",
    "                    \"max_feature_num_per_sample\": len(embeddings_count_encode),\n",
    "                    \"max_nnz\": 1,\n",
    "                    \"slot_num\": len(embeddings_count_encode)\n",
    "                }\n",
    "            ]\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"sparse_embedding1\",\n",
    "            \"type\": \"LocalizedSlotSparseEmbeddingHash\",\n",
    "            \"bottom\": \"data1\",\n",
    "            \"top\": \"sparse_embedding1\",\n",
    "            \"sparse_embedding_hparam\": {\n",
    "                \"slot_size_array\": embedding_size_str_count_encode,\n",
    "                \"embedding_vec_size\": embedding_vec_size,\n",
    "                \"combiner\": 0\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc1\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"dense\",\n",
    "            \"top\": \"fc1\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 512\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu1\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc1\",\n",
    "            \"top\": \"relu1\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc2\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu1\",\n",
    "            \"top\": \"fc2\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 256\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu2\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc2\",\n",
    "            \"top\": \"relu2\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc3\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu2\",\n",
    "            \"top\": \"fc3\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": embedding_vec_size\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu3\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc3\",\n",
    "            \"top\": \"relu3\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"interaction1\",\n",
    "            \"type\": \"Interaction\",\n",
    "            \"bottom\": [\n",
    "                \"relu3\",\n",
    "                \"sparse_embedding1\"\n",
    "            ],\n",
    "            \"top\": \"interaction1\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc4\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"interaction1\",\n",
    "            \"top\": \"fc4\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1024\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu4\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc4\",\n",
    "            \"top\": \"relu4\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc5\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu4\",\n",
    "            \"top\": \"fc5\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1024\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu5\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc5\",\n",
    "            \"top\": \"relu5\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc6\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu5\",\n",
    "            \"top\": \"fc6\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 512\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu6\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc6\",\n",
    "            \"top\": \"relu6\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc7\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu6\",\n",
    "            \"top\": \"fc7\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 256\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu7\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc7\",\n",
    "            \"top\": \"relu7\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc8\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu7\",\n",
    "            \"top\": \"fc8\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"loss\",\n",
    "            \"type\": \"BinaryCrossEntropyLoss\",\n",
    "            \"bottom\": [\n",
    "                \"fc8\",\n",
    "                \"label\"\n",
    "            ],\n",
    "            \"top\": \"loss\"\n",
    "        }\n",
    "    ]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = {\n",
    "    \"optimizer\": optimizer,\n",
    "    \"layers\": layers\n",
    "}\n",
    "with open(config_file_path,'w') as f:\n",
    "    json.dump(config,f,indent = 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solver related parameters\n",
    "NUM_GPUS = [0]                                               # GPUs used for training\n",
    "json_file = config_file_path                                       # Path to the json config file\n",
    "batchsize = 2048                                                   # Batch size used for training\n",
    "batchsize_eval = 2048                                              # Batch size used during evaluation\n",
    "max_eval_batches = 3768                                            # Iterations required to go through the complete validation set with the set batchsize_eval\n",
    "\n",
    "# Training related parameters\n",
    "num_iter = 30001                                                   # Iterations to train the model for\n",
    "eval_trigger = 10000                                               # Start evaluation after these iterations\n",
    "snapshot_trigger = 10000                                           # Save model checkpoints after these iterations\n",
    "\n",
    "solver_config = solver_parser_helper(\n",
    "                                    seed = 0,\n",
    "                                    batchsize = batchsize,                       # Minibatch size for training\n",
    "                                    batchsize_eval = batchsize_eval,         # Minibatch size for eval \n",
    "                                    max_eval_batches = max_eval_batches,     # Max no. of eval batches on which eval will be done\n",
    "                                    model_file = \"\",                         # Load any pretrained model , if training from scratch, leave empty\n",
    "                                    embedding_files = [],                    # Path to trained embedding table, if training from scratch then leave empty\n",
    "                                    vvgpu = [NUM_GPUS],                      # GPU Indices to be used ofr training\n",
    "                                    use_mixed_precision = False,             # Flag to indicate use of Mixed precision training \n",
    "                                    scaler = 1024,                           # To be set when MixedPrecisiontraining is ON\n",
    "                                    i64_input_key = True,                    # As we are using Parquet from NVTabular, I64 should be true \n",
    "                                    use_algorithm_search = False,            # Enable algo search within the fully connected-layers\n",
    "                                    use_cuda_graph = False,                  # Enable cuda graph for forward and back proppogation\n",
    "                                    repeat_dataset = True                    # Repeat the dataset for training, True for Non Epoch Based Training\n",
    "                                    )\n",
    "\n",
    "lr_sch = get_learning_rate_scheduler(json_file)                    # Get learning rate statistics from optimizers     \n",
    "\n",
    "sess = Session(solver_config, json_file)                           # Initialise a Session Object\n",
    "sess.start_data_reading()                                          # Start Data Reading\n",
    "\n",
    "for i in range(num_iter):                                          # Start training loop\n",
    "    lr = lr_sch.get_next()                                         # Update learning rate parameters                                   \n",
    "    sess.set_learning_rate(lr)                                     # Pass the updated learning rate to the session\n",
    "    sess.train()                                                   # Train on 1 iteration on 1 Minibatch\n",
    "\n",
    "    if (i%1000 == 0):\n",
    "        loss = sess.get_current_loss()                             # Returns the loss value for the current iteration.\n",
    "        print(\"[HUGECTR][INFO] iter: {}; loss: {:.6f}; lr: {:.6f}\".format(i, loss, lr))\n",
    "    if (i%eval_trigger == 0 and i != 0):\n",
    "        sess.check_overflow()                                      # Checks whether any embedding has encountered overflow\n",
    "        sess.copy_weights_for_evaluation()                         # Copies the weights of the dense network from training layers to evaluation layers.\n",
    "        for _ in range(solver_config.max_eval_batches):\n",
    "            sess.eval()                                            # Calculates the evaluation metrics based on one minibatch of evaluation data\n",
    "        metrics = sess.get_eval_metrics()                          # Returns the average evaluation metrics of several minibatches of evaluation data.\n",
    "        print(\"[HUGECTR][INFO] iter: {}, {}\".format(i, metrics))\n",
    "    if (i%snapshot_trigger == 0 and i != 0):\n",
    "        sess.download_params_to_files(weights_output_path , i)     # Saving model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 8: Inference on 1st validation set with HugeCTR\n",
    "\n",
    "After training 2 DLRM models, let's evaluate them on the validation set using HugeCTR's python inference API. The evaluation metric is AUC.<br>\n",
    "We will first write the inference config file, then prepare the validation data into CSR format and finally use the inference APIs to get the predictions.\n",
    "\n",
    "Let's start with the first trained model i.e. DLRM trained on simple time based features. In the next step, we would repeat the same process for the second trained model. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Infer config json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We need to first prepare a inference json file. The important modifications required in the train config file are:\n",
    "\n",
    "- Omit the `optimizer` and `solver` clauses, while adding and `inference` clause\n",
    "- Change the output layer to `Sigmoid` type. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first write the infer config file for the simple time based feature dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "config_inference_file_path = os.path.join(config_output_path,'dlrm_fp32_simple-time_1gpu_inference.json')\n",
    "weights_output_path = os.path.join(weights_path,'dlrm_fp32_simple-time_1gpu/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Adding the path to the dense and sparse model checkpoints from the weights directory\n",
    "inference = {\n",
    "    \"max_batchsize\": 2048,\n",
    "    \"dense_model_file\": weights_output_path+\"/_dense_30000.model\",\n",
    "    \"sparse_model_file\": weights_output_path+\"/0_sparse_30000.model\",\n",
    "    \"input_key_type\": \"I64\"\n",
    "  }\n",
    "\n",
    "layers = [\n",
    "      {\n",
    "            \"name\": \"data\",\n",
    "            \"type\": \"Data\",\n",
    "            \"check\": \"None\",\n",
    "            \"label\": {\n",
    "                \"top\": \"label\",\n",
    "                \"label_dim\": 1\n",
    "            },\n",
    "            \"dense\": {\n",
    "                \"top\": \"dense\",\n",
    "                \"dense_dim\": 1\n",
    "            },\n",
    "            \"sparse\": [\n",
    "                {\n",
    "                    \"top\": \"data1\",\n",
    "                    \"type\": \"LocalizedSlot\",\n",
    "                    \"max_feature_num_per_sample\": len(embedding_size_str_simple_time),\n",
    "                    \"max_nnz\": 1,\n",
    "                    \"slot_num\": len(embedding_size_str_simple_time)\n",
    "                }\n",
    "            ]\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"sparse_embedding1\",\n",
    "            \"type\": \"LocalizedSlotSparseEmbeddingHash\",\n",
    "            \"bottom\": \"data1\",\n",
    "            \"top\": \"sparse_embedding1\",\n",
    "            \"sparse_embedding_hparam\": {\n",
    "                \"slot_size_array\": embedding_size_str_simple_time,\n",
    "                \"embedding_vec_size\": embedding_vec_size,\n",
    "                \"combiner\": 0\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc1\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"dense\",\n",
    "            \"top\": \"fc1\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 512\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu1\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc1\",\n",
    "            \"top\": \"relu1\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc2\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu1\",\n",
    "            \"top\": \"fc2\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 256\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu2\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc2\",\n",
    "            \"top\": \"relu2\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc3\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu2\",\n",
    "            \"top\": \"fc3\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": embedding_vec_size\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu3\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc3\",\n",
    "            \"top\": \"relu3\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"interaction1\",\n",
    "            \"type\": \"Interaction\",\n",
    "            \"bottom\": [\n",
    "                \"relu3\",\n",
    "                \"sparse_embedding1\"\n",
    "            ],\n",
    "            \"top\": \"interaction1\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc4\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"interaction1\",\n",
    "            \"top\": \"fc4\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1024\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu4\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc4\",\n",
    "            \"top\": \"relu4\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc5\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu4\",\n",
    "            \"top\": \"fc5\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1024\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu5\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc5\",\n",
    "            \"top\": \"relu5\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc6\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu5\",\n",
    "            \"top\": \"fc6\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 512\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu6\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc6\",\n",
    "            \"top\": \"relu6\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc7\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu6\",\n",
    "            \"top\": \"fc7\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 256\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu7\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc7\",\n",
    "            \"top\": \"relu7\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc8\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu7\",\n",
    "            \"top\": \"fc8\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1\n",
    "            }\n",
    "        },\n",
    "\n",
    "      {\n",
    "        \"name\": \"sigmoid\",\n",
    "        \"type\": \"Sigmoid\",\n",
    "        \"bottom\": [\"fc8\"],\n",
    "        \"top\": \"sigmoid\"\n",
    "      }\n",
    "    ]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = {\n",
    "    \"inference\": inference,\n",
    "    \"layers\": layers\n",
    "}\n",
    "\n",
    "with open(config_inference_file_path,'w') as f:\n",
    "    json.dump(config,f,indent = 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare validation set for inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "output_valid_path = os.path.join(BASE_DIR, \"processed_nvt/valid\")\n",
    "\n",
    "nvtdata_test = pd.read_parquet(output_valid_path)\n",
    "nvtdata_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "con_feats = ['hist_count']\n",
    "\n",
    "cat_feats = ['time_hour',\n",
    " 'hist_cat_0',\n",
    " 'hist_subcat_0',\n",
    " 'hist_cat_1',\n",
    " 'hist_subcat_1',\n",
    " 'hist_cat_2',\n",
    " 'hist_subcat_2',\n",
    " 'hist_cat_3',\n",
    " 'hist_subcat_3',\n",
    " 'hist_cat_4',\n",
    " 'hist_subcat_4',\n",
    " 'hist_cat_5',\n",
    " 'hist_subcat_5',\n",
    " 'hist_cat_6',\n",
    " 'hist_subcat_6',\n",
    " 'hist_cat_7',\n",
    " 'hist_subcat_7',\n",
    " 'hist_cat_8',\n",
    " 'hist_subcat_8',\n",
    " 'hist_cat_9',\n",
    " 'hist_subcat_9',\n",
    " 'impr_cat',\n",
    " 'impr_subcat',\n",
    " 'impression_id',\n",
    " 'uid',\n",
    " 'time_minute',\n",
    " 'time_second',\n",
    " 'time_wd',\n",
    " 'time_day',\n",
    " 'time_day_week',\n",
    " 'time']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For inference, HugeCTR expects the data to conform to CSR format which mandates the categorical variables to occupy different integer ranges.<br>\n",
    "As an example, if there are 10 users and 10 items then HugeCTR expects the users to be encoded in the 1-10 range, while the items to be encoded in the 11-20 range. NVTabular encodes both users and items in the 1-10 ranges.\n",
    "\n",
    "For this reason, we need to shift the keys of the categorical variable produced by NVTabular to comply with HugeCTR."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "shift = np.insert(np.cumsum(embedding_size_str_simple_time), 0, 0)[:-1]\n",
    "cat_data = nvtdata_test[cat_feats].values + shift\n",
    "dense_data = nvtdata_test[con_feats].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create inference session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from mpi4py import MPI\n",
    "from hugectr.inference import CreateParameterServer, CreateEmbeddingCache, InferenceSession\n",
    "\n",
    "# Declare parameter server, embedding cache and inference session using inference config\n",
    "parameter_server = CreateParameterServer([config_inference_file_path],[\"DLRM\"],True)\n",
    "embedding_cache = CreateEmbeddingCache(parameter_server,0,True,0.2,config_inference_file_path,\"DLRM\",True)\n",
    "inference_session = InferenceSession(config_inference_file_path,0,embedding_cache)\n",
    "\n",
    "# Define a function to perform batched inference\n",
    "def infer_batch(inference_session, dense_data_batch, cat_data_batch):\n",
    "    dense_features = list(dense_data_batch.flatten())\n",
    "    embedding_columns = list(cat_data_batch.flatten())\n",
    "\n",
    "    row_ptrs= list(range(0,len(embedding_columns)+1))\n",
    "    \n",
    "    output = inference_session.predict(dense_features, embedding_columns, row_ptrs, True)\n",
    "    return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready to carry out inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 2048\n",
    "num_batches = (len(dense_data) // batch_size) + 1\n",
    "batch_idx = np.array_split(np.arange(len(dense_data)), num_batches)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = []\n",
    "\n",
    "for batch_id in tqdm(batch_idx):\n",
    "    dense_data_batch = dense_data[batch_id]\n",
    "    cat_data_batch = cat_data[batch_id]\n",
    "    results = infer_batch(inference_session, dense_data_batch, cat_data_batch)\n",
    "    labels.extend(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract ground truth to calculate AUC\n",
    "ground_truth = nvtdata_test['label'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "roc_auc_score(ground_truth, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 9: Inference on 2nd validation set with HugeCTR\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following the same procedure as in the last step, let's compute the AUC on the count plus target encoded feature engineered validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config_inference_file_path = os.path.join(config_output_path, 'dlrm_fp32_count-target-encode_1gpu_inference.json')\n",
    "weights_output_path = os.path.join(weights_path,'dlrm_fp32_count-target-encode_1gpu/')\n",
    "\n",
    "inference = {\n",
    "    \"max_batchsize\": 2048,\n",
    "    \"dense_model_file\": weights_output_path+\"/_dense_30000.model\",\n",
    "    \"sparse_model_file\": weights_output_path+\"/0_sparse_30000.model\",\n",
    "    \"input_key_type\": \"I64\"\n",
    "  }\n",
    "\n",
    "layers = [\n",
    "      {\n",
    "            \"name\": \"data\",\n",
    "            \"type\": \"Data\",\n",
    "            \"check\": \"None\",\n",
    "            \"label\": {\n",
    "                \"top\": \"label\",\n",
    "                \"label_dim\": 1\n",
    "            },\n",
    "            \"dense\": {\n",
    "                \"top\": \"dense\",\n",
    "                \"dense_dim\": 9\n",
    "            },\n",
    "            \"sparse\": [\n",
    "                {\n",
    "                    \"top\": \"data1\",\n",
    "                    \"type\": \"LocalizedSlot\",\n",
    "                    \"max_feature_num_per_sample\": len(embedding_size_str_count_encode),\n",
    "                    \"max_nnz\": 1,\n",
    "                    \"slot_num\": len(embedding_size_str_count_encode)\n",
    "                }\n",
    "            ]\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"sparse_embedding1\",\n",
    "            \"type\": \"LocalizedSlotSparseEmbeddingHash\",\n",
    "            \"bottom\": \"data1\",\n",
    "            \"top\": \"sparse_embedding1\",\n",
    "            \"sparse_embedding_hparam\": {\n",
    "                \"slot_size_array\": embedding_size_str_count_encode,\n",
    "                \"embedding_vec_size\": embedding_vec_size,\n",
    "                \"combiner\": 0\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc1\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"dense\",\n",
    "            \"top\": \"fc1\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 512\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu1\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc1\",\n",
    "            \"top\": \"relu1\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc2\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu1\",\n",
    "            \"top\": \"fc2\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 256\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu2\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc2\",\n",
    "            \"top\": \"relu2\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc3\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu2\",\n",
    "            \"top\": \"fc3\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": embedding_vec_size\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu3\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc3\",\n",
    "            \"top\": \"relu3\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"interaction1\",\n",
    "            \"type\": \"Interaction\",\n",
    "            \"bottom\": [\n",
    "                \"relu3\",\n",
    "                \"sparse_embedding1\"\n",
    "            ],\n",
    "            \"top\": \"interaction1\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc4\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"interaction1\",\n",
    "            \"top\": \"fc4\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1024\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu4\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc4\",\n",
    "            \"top\": \"relu4\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc5\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu4\",\n",
    "            \"top\": \"fc5\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1024\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu5\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc5\",\n",
    "            \"top\": \"relu5\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc6\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu5\",\n",
    "            \"top\": \"fc6\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 512\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu6\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc6\",\n",
    "            \"top\": \"relu6\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc7\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu6\",\n",
    "            \"top\": \"fc7\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 256\n",
    "            }\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"relu7\",\n",
    "            \"type\": \"ReLU\",\n",
    "            \"bottom\": \"fc7\",\n",
    "            \"top\": \"relu7\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"fc8\",\n",
    "            \"type\": \"InnerProduct\",\n",
    "            \"bottom\": \"relu7\",\n",
    "            \"top\": \"fc8\",\n",
    "            \"fc_param\": {\n",
    "                \"num_output\": 1\n",
    "            }\n",
    "        },\n",
    "\n",
    "      {\n",
    "        \"name\": \"sigmoid\",\n",
    "        \"type\": \"Sigmoid\",\n",
    "        \"bottom\": [\"fc8\"],\n",
    "        \"top\": \"sigmoid\"\n",
    "      }\n",
    "    ]\n",
    "\n",
    "\n",
    "config = {\n",
    "    \"inference\": inference,\n",
    "    \"layers\": layers\n",
    "}\n",
    "\n",
    "with open(config_inference_file_path,'w') as f:\n",
    "    json.dump(config,f,indent = 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_valid_path = os.path.join(BASE_DIR, \"processed_ce-te/valid\")\n",
    "\n",
    "nvtdata_test = pd.read_parquet(output_valid_path)\n",
    "nvtdata_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "con_feats = [\n",
    " 'TE_hist_cat_0_hist_cat_1_hist_cat_2_hist_cat_3_hist_cat_4_impr_cat_label_TE',\n",
    " 'TE_hist_cat_1_hist_cat_2_hist_cat_3_hist_cat_4_hist_cat_5_impr_cat_label_TE',\n",
    " 'TE_hist_cat_2_hist_cat_3_hist_cat_4_hist_cat_5_hist_cat_6_impr_cat_label_TE',\n",
    " 'TE_hist_cat_3_hist_cat_4_hist_cat_5_hist_cat_6_hist_cat_7_impr_cat_label_TE',\n",
    " 'TE_hist_cat_4_hist_cat_5_hist_cat_6_hist_cat_7_hist_cat_8_impr_cat_label_TE',\n",
    " 'TE_hist_cat_5_hist_cat_6_hist_cat_7_hist_cat_8_hist_cat_9_impr_cat_label_TE',\n",
    " 'hist_count',\n",
    " 'impr_cat_count',\n",
    " 'impr_subcat_count']\n",
    "\n",
    "cat_feats = ['time',\n",
    " 'hist_cat_0',\n",
    " 'hist_subcat_0',\n",
    " 'hist_cat_1',\n",
    " 'hist_subcat_1',\n",
    " 'hist_cat_2',\n",
    " 'hist_subcat_2',\n",
    " 'hist_cat_3',\n",
    " 'hist_subcat_3',\n",
    " 'hist_cat_4',\n",
    " 'hist_subcat_4',\n",
    " 'hist_cat_5',\n",
    " 'hist_subcat_5',\n",
    " 'hist_cat_6',\n",
    " 'hist_subcat_6',\n",
    " 'hist_cat_7',\n",
    " 'hist_subcat_7',\n",
    " 'hist_cat_8',\n",
    " 'hist_subcat_8',\n",
    " 'hist_cat_9',\n",
    " 'hist_subcat_9',\n",
    " 'impr_cat',\n",
    " 'impr_subcat',\n",
    " 'impression_id',\n",
    " 'uid',\n",
    " 'time_hour',\n",
    " 'time_minute',\n",
    " 'time_second',\n",
    " 'time_wd',\n",
    " 'time_day',\n",
    " 'time_day_week']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shift = np.insert(np.cumsum(embedding_size_str_count_encode), 0, 0)[:-1]\n",
    "cat_data = nvtdata_test[cat_feats].values + shift\n",
    "\n",
    "dense_data = nvtdata_test[con_feats].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create parameter server, embedding cache and inference session\n",
    "parameter_server = CreateParameterServer([config_inference_file_path],[\"DLRM\"],True)\n",
    "embedding_cache = CreateEmbeddingCache(parameter_server,0,True,0.2,config_inference_file_path,\"DLRM\",True)\n",
    "inference_session = InferenceSession(config_inference_file_path,0,embedding_cache)\n",
    "\n",
    "# Define a function to perform batched inference\n",
    "\n",
    "def infer_batch(inference_session, dense_data_batch, cat_data_batch):\n",
    "    dense_features = list(dense_data_batch.flatten())\n",
    "    embedding_columns = list(cat_data_batch.flatten())\n",
    "\n",
    "    row_ptrs= list(range(0,len(embedding_columns)+1))\n",
    "    \n",
    "    output = inference_session.predict(dense_features, embedding_columns, row_ptrs, True)\n",
    "    return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready to carry out inference on the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 2048\n",
    "num_batches = (len(dense_data) // batch_size) + 1\n",
    "batch_idx = np.array_split(np.arange(len(dense_data)), num_batches)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = []\n",
    "for batch_id in tqdm(batch_idx):\n",
    "    dense_data_batch = dense_data[batch_id]\n",
    "    cat_data_batch = cat_data[batch_id]\n",
    "    results = infer_batch(inference_session, dense_data_batch, cat_data_batch)\n",
    "    labels.extend(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract ground truth to calculate AUC\n",
    "\n",
    "ground_truth = nvtdata_test['label'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "roc_auc_score(ground_truth, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial notebook, we have walked through the process of data cleaning, pre-processing, feature engineering to model training and inferencing, all using the Merlin framework. We hope that this notebook would be helpful for building Recommendation Systems on your datasets as well.\n",
    "\n",
    "Feel free to experiment with the various hyper-parameters on the feature engineering and model training side and share your results!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
